AITopics | ai judge

Collaborating Authors

ai judge

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AI Debate Aids Assessment of Controversial Claims

Neural Information Processing SystemsJun-14-2026, 07:51:43 GMT

As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides--especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs.

artificial intelligence, name change, proceedings, (11 more...)

Neural Information Processing Systems

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.43)
Health & Medicine > Therapeutic Area > Immunology (0.43)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model

Lee, Youngwan, Kim, Kangsan, Park, Kwanyong, Jung, Ilcahe, Jang, Soojin, Lee, Seanie, Lee, Yong-Ju, Hwang, Sung Ju

arXiv.org Artificial IntelligenceNov-26-2025

Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.

category, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2506.04704

Genre: Research Report > New Finding (0.45)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Consumer Health (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs

Yu, Fangyi

arXiv.org Artificial IntelligenceAug-6-2025

As large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck. A new paradigm is emerging: using AI agents as the evaluators themselves. This "agent-as-a-judge" approach leverages the reasoning and perspective-taking abilities of LLMs to assess the quality and safety of other models, promising calable and nuanced alternatives to human evaluation. In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings. We compare these approaches across reliability, cost, and human alignment, and survey real-world deployments in domains such as medicine, law, finance, and education. Finally, we highlight pressing challenges-including bias, robustness, and meta evaluation-and outline future research directions. By bringing together these strands, our review demonstrates how agent-based judging can complement (but not replace) human oversight, marking a step toward trustworthy, scalable evaluation for next-generation LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.02994

Country:

North America > United States (0.29)
North America > Mexico (0.28)
Europe > Austria (0.28)

Genre:

Overview (1.00)
Research Report > New Finding (0.46)

Industry:

Health & Medicine (1.00)
Education (1.00)
Law > Litigation (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Divergent Realities: A Comparative Analysis of Human Expert vs. Artificial Intelligence Based Generation and Evaluation of Treatment Plans in Dermatology

Sengupta, Dipayan, Panda, Saumya

arXiv.org Artificial IntelligenceJul-9-2025

Background: Evaluating AI-generated treatment plans is a key challenge as AI expands beyond diagnostics, especially with new reasoning models. This study compares plans from human experts and two AI models (a generalist and a reasoner), assessed by both human peers and a superior AI judge. Methods: Ten dermatologists, a generalist AI (GPT-4o), and a reasoning AI (o3) generated treatment plans for five complex dermatology cases. The anonymized, normalized plans were scored in two phases: 1) by the ten human experts, and 2) by a superior AI judge (Gemini 2.5 Pro) using an identical rubric. Results: A profound 'evaluator effect' was observed. Human experts scored peer-generated plans significantly higher than AI plans (mean 7.62 vs. 7.16; p=0.0313), ranking GPT-4o 6th (mean 7.38) and the reasoning model, o3, 11th (mean 6.97). Conversely, the AI judge produced a complete inversion, scoring AI plans significantly higher than human plans (mean 7.75 vs. 6.79; p=0.0313). It ranked o3 1st (mean 8.20) and GPT-4o 2nd, placing all human experts lower. Conclusions: The perceived quality of a clinical plan is fundamentally dependent on the evaluator's nature. An advanced reasoning AI, ranked poorly by human experts, was judged as superior by a sophisticated AI, revealing a deep gap between experience-based clinical heuristics and data-driven algorithmic logic. This paradox presents a critical challenge for AI integration, suggesting the future requires synergistic, explainable human-AI systems that bridge this reasoning gap to augment clinical care.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.05716

Country: Asia > India (0.15)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.87)

Industry: Health & Medicine > Therapeutic Area > Dermatology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback

Enhancing Selection of Climate Tech Startups with AI -- A Case Study on Integrating Human and AI Evaluations in the ClimaTech Great Global Innovation Challenge

Turliuk, Jennifer, Sevilla, Alejandro, Gorza, Daniela, Hynes, Tod

arXiv.org Artificial IntelligenceMay-29-2025

This case study examines the ClimaTech Great Global Innovation Challenge's approach to selecting climate tech startups by integrating human and AI evaluations. The competition aimed to identify top startups and enhance the accuracy and efficiency of the selection process through a hybrid model. Research shows data-driven approaches help VC firms reduce bias and improve decision-making. Machine learning models have outperformed human investors in deal screening, helping identify high-potential startups. Incorporating AI aimed to ensure more equitable and objective evaluations. The methodology included three phases: initial AI review, semi-finals judged by humans, and finals using a hybrid weighting. In phase one, 57 applications were scored by an AI tool built with StackAI and OpenAI's GPT-4o, and the top 36 advanced. In the semi-finals, human judges, unaware of AI scores, evaluated startups on team quality, market potential, and technological innovation. Each score - human or AI - was weighted equally, resulting in 75 percent human and 25 percent AI influence. In the finals, with five human judges, weighting shifted to 83.3 percent human and 16.7 percent AI. There was a moderate positive correlation between AI and human scores - Spearman's = 0.47 - indicating general alignment with key differences. Notably, the final four startups, selected mainly by humans, were among those rated highest by the AI. This highlights the complementary nature of AI and human judgment. The study shows that hybrid models can streamline and improve startup assessments. The ClimaTech approach offers a strong framework for future competitions by combining human expertise with AI capabilities.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.21562

Genre: Research Report > Experimental Study (0.66)

Industry:

Banking & Finance > Trading (0.48)
Banking & Finance > Capital Markets (0.34)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

AI Judges in Design: Statistical Perspectives on Achieving Human Expert Equivalence With Vision-Language Models

Edwards, Kristen M., Tehranchi, Farnaz, Miller, Scarlett R., Ahmed, Faez

arXiv.org Artificial IntelligenceApr-1-2025

The subjective evaluation of early stage engineering designs, such as conceptual sketches, traditionally relies on human experts. However, expert evaluations are time-consuming, expensive, and sometimes inconsistent. Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts. However, no existing framework assesses expert equivalence. This paper introduces a rigorous statistical framework to determine whether an AI judge's ratings match those of human experts. We apply this framework in a case study evaluating four VLM-based judges on key design metrics (uniqueness, creativity, usefulness, and drawing quality). These AI judges employ various in-context learning (ICL) techniques, including uni- vs. multimodal prompts and inference-time reasoning. The same statistical framework is used to assess three trained novices for expert-equivalence. Results show that the top-performing AI judge, using text- and image-based ICL with reasoning, achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics. In 6/6 runs for both uniqueness and creativity, and 5/6 runs for both drawing quality and usefulness, its agreement with experts meets or exceeds that of the majority of trained novices. These findings suggest that reasoning-supported VLM models can achieve human-expert equivalence in design evaluation. This has implications for scaling design evaluation in education and practice, and provides a general statistical framework for validating AI judges in other domains requiring subjective content evaluation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.00938

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Pennsylvania > Centre County > University Park (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.92)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
(2 more...)

Add feedback

Reasons to Reject? Aligning Language Models with Judgments

Xu, Weiwen, Cai, Deng, Zhang, Zhisong, Lam, Wai, Shi, Shuming

arXiv.org Artificial IntelligenceDec-22-2023

As humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. This language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We commence with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods are unable to fully capitalize on the judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our offline alignment results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis further suggests that judgments exhibit greater potential than rewards for LLM alignment and warrant future research.

arxiv preprint, judgment, llm, (14 more...)

arXiv.org Artificial Intelligence

2312.14591

Country:

North America > United States > Alabama (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Wyoming (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry: Leisure & Entertainment > Sports > Football (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Human or Machine? Turing Tests for Vision and Language

Zhang, Mengmi, Dellaferrera, Giorgia, Sikarwar, Ankur, Armendariz, Marcelo, Mudrik, Noga, Agrawal, Prachi, Madan, Spandan, Barbu, Andrei, Yang, Haochen, Kumar, Tanishq, Sadwani, Meghna, Dellaferrera, Stella, Pizzochero, Michele, Pfister, Hanspeter, Kreiman, Gabriel

arXiv.org Artificial IntelligenceNov-23-2022

As AI algorithms increasingly participate in daily activities that used to be the sole province of humans, we are inevitably called upon to consider how much machines are really like us. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans. We establish a methodology to evaluate humans versus machines in Turing-like tests and systematically evaluate a representative set of selected domains, parameters, and variables. The experiments involved testing 769 human agents, 24 state-of-the-art AI agents, 896 human judges, and 8 AI judges, in 21,570 Turing tests across 6 tasks encompassing vision and language modalities. Surprisingly, the results reveal that current AIs are not far from being able to impersonate human judges across different ages, genders, and educational levels in complex visual and language challenges. In contrast, simple AI judges outperform human judges in distinguishing human answers versus machine answers. The curated large-scale Turing test datasets introduced here and their evaluation metrics provide valuable insights to assess whether an agent is human or not. The proposed formulation to benchmark human imitation ability in current AIs paves a way for the research community to expand Turing tests to other research areas and conditions. All of source code and data are publicly available at https://tinyurl.com/8x8nha7p

artificial intelligence, machine learning, turing test, (16 more...)

arXiv.org Artificial Intelligence

2211.13087

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Michigan (0.04)
Europe > Russia (0.04)
(13 more...)

Genre:

Research Report (1.00)
Personal > Interview (0.93)

Industry:

Media (1.00)
Health & Medicine > Therapeutic Area (1.00)
Information Technology > Security & Privacy (0.93)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Issues > Turing's Test (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Weighing the Pros and Cons of AI Judges

#artificialintelligenceJan-5-2020, 09:47:24 GMT

AI is set to replace many human jobs in the future, but should lawyers and judges be among them? Here we explore where AI is already being used in judicial systems around the world and discuss if it should play a more senior role. Could, or should, AI ever be developed that could pass judgment on a living, breathing human being? RELATED: CHINA HAS UNVEILED AN AI JUDGE THAT WILL'HELP' WITH COURT PROCEEDINGS Believe it or not, AI and some forms of advanced-algorithms are already widely used in many judicial systems around the world. In the various states within the United States, for example, predictive algorithms are already being used to help reduce the load on the judicial system.

algorithm, artificial intelligence, machine learning, (18 more...)

#artificialintelligence

Country:

North America > United States (0.49)
Asia > China > Beijing > Beijing (0.06)

Industry:

Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.30)

Add feedback

China unveils digital courts with AI judges and verdicts via apps

#artificialintelligenceDec-25-2019, 14:31:04 GMT

China has developed mobile courts with artificial-intelligence judges and verdicts delivered through chat apps aiming to deal with backlogs in cases. Litigants appear by video chats as AI judge with avatar hears the cases. The country is urging digitization to streamline case-handling within its court system using cyberspace technologies such as blockchain and cloud computing, according to the Supreme People's Court in a policy paper. The paper was released in the first week of December as judicial authorities provided journalists a sneak peek of the country's first cyber court which was established in 2017 in Hangzhou city. Social media platform WeChat has reportedly handled over three million legal cases already or other judicial procedures since its launch in March.

ai judge, china unveil digital court, judge and verdict, (5 more...)

#artificialintelligence

Country: Asia > China > Zhejiang Province > Hangzhou (0.34)

Industry:

Law (1.00)
Information Technology > Services (0.43)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence (0.73)

Add feedback